AITopics | training data selection

In this work, we introduce Entropy Area Score (EAS), a simple yet effective metric to quantify uncertainty in the answer generation process of reasoning large language models (LLMs). EAS requires neither external models nor repeated sampling, it integrates token-level predictive entropy from the model itself to capture the evolution of uncertainty during generation. Empirical results show that EAS is strongly correlated with answer entropy across models and datasets. In training data selection, EAS identifies high-potential samples and consistently outperforms Pass Rate filtering under equal sample budgets, improving student model accuracy on math benchmarks. EAS is both efficient and interpretable, offering a practical tool for uncertainty modeling and data quality assessment in LLM training.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.20384

Genre: Research Report > New Finding (0.88)

Industry: Education (0.88)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Sliced-Wasserstein Distance-based Data Selection

Pallage, Julien, Lesage-Landry, Antoine

arXiv.org Artificial IntelligenceApr-18-2025

We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2504.12918

Country: North America (0.14)

Genre: Research Report (0.82)

Industry:

Energy > Power Industry (1.00)
Automobiles & Trucks (1.00)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection

Reid, Mirabel, Sweeney, Christine, Korobkin, Oleg

arXiv.org Artificial IntelligenceAug-22-2024

Most machine learning models require many iterations of hyper-parameter tuning, feature engineering, and debugging to produce effective results. As machine learning models become more complicated, this pipeline becomes more difficult to manage effectively. In the physical sciences, there is an ever-increasing pool of metadata that is generated by the scientific research cycle. Tracking this metadata can reduce redundant work, improve reproducibility, and aid in the feature and training dataset engineering process. In this case study, we present a tool for machine learning metadata management in dynamic radiography. We evaluate the efficacy of this tool against the initial research workflow and discuss extensions to general machine learning pipelines in the physical sciences.

metadata management, radiography machine learning workflow, training data selection

arXiv.org Artificial Intelligence

2408.12655

Genre:

Research Report (0.69)
Workflow (0.60)

Industry:

Health & Medicine > Nuclear Medicine (0.60)
Health & Medicine > Diagnostic Medicine > Imaging (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Neural Information Processing SystemsApr-6-2023, 17:13:39 GMT

In this paper, we consider the problem of active learning in trigonomet(cid:173) ric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By ana(cid:173) lyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the compu(cid:173) tational complexity and memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effec(cid:173) tiveness of the proposed active learning method.

optimal generalization, training data selection, trigonometric polynomial network, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)

Add feedback

Sub-Setting Algorithm for Training Data Selection in Pattern Recognition

Arwade, AGaurav, Olafsson, Sigurdur

arXiv.org Machine LearningOct-13-2021

Modern pattern recognition tasks use complex algorithms that take advantage of large datasets to make more accurate predictions than traditional algorithms such as decision trees or k-nearest-neighbor better suited to describe simple structures. While increased accuracy is often crucial, less complexity also has value. This paper proposes a training data selection algorithm that identifies multiple subsets with simple structures. A learning algorithm trained on such a subset can classify an instance belonging to the subset with better accuracy than the traditional learning algorithms. In other words, while existing pattern recognition algorithms attempt to learn a global mapping function to represent the entire dataset, we argue that an ensemble of simple local patterns may better describe the data. Hence the sub-setting algorithm identifies multiple subsets with simple local patterns by identifying similar instances in the neighborhood of an instance. This motivation has similarities to that of gradient boosted trees but focuses on the explainability of the model that is missing for boosted trees. The proposed algorithm thus balances accuracy and explainable machine learning by identifying a limited number of subsets with simple structures. We applied the proposed algorithm to the international stroke dataset to predict the probability of survival. Our bottom-up sub-setting algorithm performed on an average 15% better than the top-down decision tree learned on the entire dataset. The different decision trees learned on the identified subsets use some of the previously unused features by the whole dataset decision tree, and each subset represents a distinct population of data.

algorithm, decision tree, subset, (13 more...)

arXiv.org Machine Learning

2110.06527

Country:

North America > United States > Iowa > Story County > Ames (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.55)

Add feedback

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Sugiyama, Masashi, Ogawa, Hidemitsu

Neural Information Processing SystemsDec-31-2000

In this paper, we consider the problem of active learning in trigonometric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By analyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the computational complexity and memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effectiveness of the proposed active learning method.

operator, sample point, training example, (10 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
North America > United States > New York (0.04)
North America > United States > California > San Mateo County > San Mateo (0.04)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.80)

Add feedback

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Sugiyama, Masashi, Ogawa, Hidemitsu

Neural Information Processing SystemsDec-31-2000

In this paper, we consider the problem of active learning in trigonometric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By analyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the computational complexity and memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effectiveness of the proposed active learning method.

operator, sample point, training example, (10 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
North America > United States > New York (0.04)
North America > United States > California > San Mateo County > San Mateo (0.04)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.80)

Add feedback

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Sugiyama, Masashi, Ogawa, Hidemitsu

Neural Information Processing SystemsDec-31-2000

In this paper, we consider the problem of active learning in trigonometric polynomialnetworks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By analyzing thecondition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the computational complexityand memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effectiveness ofthe proposed active learning method. 1 Introduction Supervised learning is obtaining an underlying rule from training examples, and can be formulated as a function approximation problem. If sample points are actively designed, then learning can be performed more efficiently. In this paper, we discuss the problem of designing sample points, referred to as active learning, for optimal generalization. Active learning is classified into two categories depending on the optimality. One is global optimal, where a set of all training examples is optimal (e.g.

operator, sample point, training example, (9 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
North America > United States > New York (0.04)
North America > United States > California > San Mateo County > San Mateo (0.04)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

Filters

Collaborating Authors

training data selection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Uncertainty Under the Curve: A Sequence-Level Entropy Area Metric for Reasoning LLM

Sliced-Wasserstein Distance-based Data Selection

Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Sub-Setting Algorithm for Training Data Selection in Pattern Recognition

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks